Configurable vector length computer processor

ABSTRACT

A processor core, comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length. Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.

FIELD OF THE INVENTION

The invention relates generally to vector computer processors, and morespecifically in one embodiment to a configurable vector length computerprocessor.

LIMITED COPYRIGHT WAIVER

A portion of the disclosure of this patent document contains material towhich the claim of copyright protection is made. The copyright owner hasno objection to the facsimile reproduction by any person of the patentdocument or the patent disclosure, as it appears in the U.S. Patent andTrademark Office file or records, but reserves all other rightswhatsoever.

BACKGROUND

Most general purpose computer systems are built around a general-purposeprocessor, which is typically an integrated circuit operable to performa wide variety of operations useful for executing a wide variety ofsoftware. The processor is able to perform a fixed set of instructions,which collectively are known as the instruction set for the processor. Atypical instruction set includes a variety of types of instructions,including arithmetic, logic, and data instructions.

In more sophisticated computer systems, multiple processors are used,and one or more processors runs software that is operable to assigntasks to other processors or to split up a task so that it can be workedon by multiple processors at the same time. In such systems, the databeing worked on is typically stored in memory that is eithercentralized, or is split up among the different processors working on atask.

Instructions from the instruction set of the computer's processor orprocessor that are chosen to perform a certain task form a softwareprogram that can be executed on the computer system. Typically, thesoftware program is first written in a high-level language such as “C”that is easier for a programmer to understand than the processor'sinstruction set, and a program called a compiler converts the high-levellanguage program code to processor-specific instructions.

In multiprocessor systems, the programmer or the compiler will usuallylook for tasks that can be performed in parallel, such as calculationswhere the data used to perform a first calculation are not dependent onthe results of certain other calculations such that the firstcalculation and other calculations can be performed at the same time.The calculations performed at the same time are said to be performed inparallel, and can result in significantly faster execution of theprogram. Although some programs such as web browsers and word processorsdon't consume a high percentage of even a single processor's resourcesand don't have many operations that can be performed in parallel, otheroperations such as scientific simulation can often run hundreds orthousands of times faster in computers with thousands of parallelprocessing nodes available.

Multiple operations can also be performed at the same time using one ormore vector processors, which perform an operation on multiple dataelements at the same time. For example, rather than instruction thatadds two numbers together to produce a third number, a vectorinstruction may add elements from a 64-element vector to elements from asecond 64-element vector to produce a third 64-element vector, whereeach element of the third vector is the sum of the correspondingelements in the first and second vectors.

In this example, the vector registers each hold 64 elements, so thevector length is said to be 64. The vector processor can handle sets ofdata smaller than 64 by using a vector length register specifying thatsome number fewer than 64 elements are to be processed, or can handlesets of data larger than 64 elements by using multiple vector operationsto process all elements in the data set, such as by using a programloop.

The vectors in some further examples do not operate on elements that aresequential in memory, but instead operate on elements that are spacedsome distance apart, such as on certain elements of a large array forscientific computing and modeling applications. This distance betweenelements in a vector is referred to as the stride, such that sequentialwords from memory have a stride of one, whereas a vector comprisingevery sixteenth element in memory has a stride of 16.

Vector processing provides other benefits to program efficiency, but atthe cost of significant load or startup time relative to a scalaroperation. Although the vectors must be completely loaded from memorybefore functions can be performed on the elements, other steps such aschecking for variable independence need only be performed once for anentire vector operation. Instruction and coding efficiency are alsoimproved with vector operations, as is memory access where the vectorhas a known or consistent memory access pattern. Vector processor designchoices such as vector length consider these efficiencies and tradeoffsin an attempt to provide both good scalar operation performance andefficient vector operation.

SUMMARY

Some embodiments of the invention comprise a processor core thatcomprises one or more vector units operable to change between afine-grained vector mode having a shorter maximum vector length and acoarse-grained vector mode having a longer maximum vector length.Changing vector modes comprises halting all instruction stream executionin the core, flushing one or more registers in a register space,reconfiguring one or more vector registers in the register space, andrestarting instruction execution in the core.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 shows a reconfigurable vector space supporting four streams and avector length of 16, consistent with an example embodiment of theinvention.

FIG. 2 shows a reconfigurable vector space supporting 32 streams and avector length of one, consistent with an example embodiment of theinvention.

FIG. 3 shows a vector processor having configurable vector modes,consistent with an example embodiment of the invention.

DETAILED DESCRIPTION

In the following detailed description of example embodiments of theinvention, reference is made to specific examples by way of drawings andillustrations. These examples are described in sufficient detail toenable those skilled in the art to practice the invention, and serve toillustrate how the invention may be applied to various purposes orapplications. Other embodiments of the invention exist and are withinthe scope of the invention, and logical, mechanical, electrical, andother changes may be made without departing from the scope or subject ofthe present invention. Features or limitations of various embodiments ofthe invention described herein, however essential to the exampleembodiments in which they are incorporated, do not limit the inventionas a whole, and any reference to the invention, its elements, operation,and application do not limit the invention as a whole but serve only todefine these example embodiments. The following detailed descriptiondoes not, therefore, limit the scope of the invention, which is definedonly by the appended claims.

Vector processor architectures often include vector registers having afixed number of entries, each vector register capable of holding asingle vector. Vector functional units, such as an add/subtract unit, amultiply unit and a divide unit, and logic operation units are eitherdedicated to serving vector operations or are shared with scalaroperations. Scalar registers are also used in some vector operations,such as where every element of a vector is multiplied by a scalarnumber. An example processor might have, for example, eight vectorregisters with 64 elements per register, where each element is a 64-bitword.

It is desirable in some applications to have vector lengths that arelonger, while in other applications greater performance could beachieved if vector lengths were shorter or if the processor functionedmore like a scalar processor. One embodiment of the invention seeks toaddress problems such as this by providing a reconfigurable processorcore, such as where a more vectorized and a less vectorizedconfiguration are available within the same processor core and can beselected to improve application execution efficiency.

In one such example, a processor chip contains 32 cores, where each coreis capable of operating in either a vector threaded mode supporting fourstreams having a maximum vector length of 16, or a scalar threaded modesupporting 32 streams of a maximum vector length of one. Each mode hasthe same instruction set architecture, same instruction issue rate, andsame instruction processing performance, but will provide differentapplication performance based on the parallelization or vectorizationthat can be achieved for a given application.

In one such example illustrated in FIGS. 1 and 2, the vector registersand address registers allocated to different numbers of instructionstreams are shown, demonstrating how an example register space isconfigured to facilitate changing vector modes. In this example, 3,072registers are organized as 96 registers with 32 elements each. FIG. 1shows the example register space configured to support four streamshaving a maximum vector length of 16, whereas FIG. 2 illustrates thesame register space configured to support 32 streams with a maximumvector length of one.

Vector registers allocated to each of four different instruction streamsof the four-stream 16-element vector configuration are shown at 101,each stream being allocated 32 registers having 16 elements each, suchthat there is a maximum vector length of 16. Address registers for eachstream are allocated in register space 102, but only consume twoelements of 32 registers per stream—the remaining register space that iscrossed out is unused in this vector mode.

In FIG. 2, the same vector register space is configured such that eachof 32 streams is allocated vector register space having 32 elementseach, for a total of 1024 registers. The remaining 2048 registers areallocated as address registers as shown at 202, such that each of the 32streams is allocated 64 address registers.

In this example embodiment, the address registers and vector registersare a part of a processor core, as shown in FIG. 3. Here, the XPipeelement is the execution pipeline, as shown at 301, and includes theaddress register/vector register space shown in FIGS. 1 and 2 at 302 inFIG. 3. The MPipe, or memory pipeline that includes the load/store unitof the processor is shown at 303, and the IPipe or instruction pipelineis shown at 304. The instruction pipeline includes the instructionbuffers and cache, and the instruction fetch and issue logic.

To change modes between fine-grained parallel applications that benefitfrom running in a 32-stream mode and coarse-grained parallelapplications that benefit from the longer vector length of the 4-streammode, the processor core quiets all executing threads in the core beingreconfigured, and flushes the registers. The registers and instructionpipelines are reloaded under the new vector/stream mode, and executionis restarted.

Changing modes therefore involves repartitioning the register space andreassignment of registers to different streams, or between vector andaddress register allocation, depending on the embodiment beingpracticed. The actual register space remains the same, as is illustratedin the example of FIGS. 1 and 2, and the IPipe system remains the samebut switches between four and 32 instruction streams based on theselected mode. A variety of other necessary or optional changes, such aschanging a maxVL or maximum vector length register to reflect the newconfigured maximum vector length, are also employed in some embodiments,and are within the scope of the invention.

The processor of this example can therefore be configured forfine-grained or coarse-grained parallelism on the fly, even within anexecuting application. The ability to configure the processor core onthe fly, even within a job or application, provides greater flexibilityand efficiency in execution than prior systems could provide. Further,the ability to switch modes on a core-by-core basis rather than on asystem-by-system basis or chip-by-chip basis enables configuration ofindividual cores to best suit the applications assigned to thosespecific cores. For example, a processor chip containing 32 cores canconfigure 28 cores to work on a coarse-grained parallel applicationusing a vector length of 16, while the remaining four cores executefine-grained threads that do not lend themselves to vectorparallelization as well.

Although specific embodiments have been illustrated and describedherein, it will be appreciated by those of ordinary skill in the artthat any arrangement which is calculated to achieve the same purpose maybe substituted for the specific embodiments shown. This application isintended to cover any adaptations or variations of the exampleembodiments of the invention described herein. It is intended that thisinvention be limited only by the claims, and the full scope ofequivalents thereof.

1. A processor core, comprising: one or more vector units operable tochange between a fine-grained vector mode having a shorter maximumvector length and a coarse-grained vector mode having a longer maximumvector length.
 2. The processor core of claim 1, wherein the one or morevector units is further operable to change the number of instructionstreams.
 3. The processor core of claim 1, wherein the one or morevector units are operable to change the maximum vector length bychanging vector register allocation in a register space.
 4. Theprocessor core of claim 1, wherein the one or more vector units areoperable to change the maximum vector length by changing instructionissue mode.
 5. The processor core of claim 1, wherein a processorcomprises multiple processor cores each having one or more vector units,and the multiple processor cores are independently operable to changevector modes.
 6. The processor core of claim 1, wherein changing vectormodes comprises halting all instruction stream execution in the core,flushing one or more registers in a register space, reconfiguring one ormore vector registers in the register space, and restarting instructionexecution in the core.
 7. The processor core of claim 1, whereinchanging vector modes comprises using the same instruction setarchitecture in different vector modes.
 8. A multiprocessor computersystem, comprising: a plurality of processing nodes, each nodecomprising one or more local processor cores, wherein the one or morelocal processor cores each comprise one or more vector units operable tochange between a fine-grained vector mode having a shorter maximumvector length and a coarse-grained vector mode having a longer maximumvector length.
 9. The multiprocessor computer system of claim 8, whereinthe one or more vector units is further operable to change the number ofinstruction streams.
 10. The multiprocessor computer system of claim 8,wherein the one or more vector units are operable to change the maximumvector length by changing vector register allocation in a registerspace.
 11. The multiprocessor computer system of claim 8, wherein theone or more vector units are operable to change the maximum vectorlength by changing instruction issue mode.
 12. The multiprocessorcomputer system of claim 8, wherein changing vector modes compriseshalting all instruction stream execution in the core, flushing one ormore registers in a register space, reconfiguring one or more vectorregisters in the register space, and restarting instruction execution inthe core.
 13. The multiprocessor computer system of claim 8, whereinchanging vector modes comprises using the same instruction setarchitecture in different vector modes.
 14. The multiprocessor computersystem of claim 8, wherein the one or more local processor cores areoperable to independently change vector modes.
 15. A method of operatinga vector computer processor, comprising: changing between a fine-grainedvector mode having a shorter maximum vector length and a coarse-grainedvector mode having a longer maximum vector length.
 16. The method ofoperating a vector computer processor of claim 15, wherein the change invector mode is initiated by one or more of an application, an operatingsystem, a batch system, or a processor core.
 17. The method of operatinga vector computer processor of claim 15, wherein changing vector modescomprises at least one of changing the number of instruction streams,changing vector register allocation in a register space, changinginstruction issue mode.
 18. The method of operating a vector computerprocessor of claim 15, wherein changing vector modes comprises haltingall instruction stream execution in the core, flushing one or moreregisters in a register space, reconfiguring one or more vectorregisters in the register space, and restarting instruction execution inthe core.
 19. The method of operating a vector computer processor ofclaim 15, wherein the vector computer processor comprises multipleprocessor cores each having one or more vector units, and the multipleprocessor cores are independently operable to change vector modes. 20.The method of operating a vector computer processor of claim 15, whereinchanging vector modes comprises using the same instruction setarchitecture in different vector modes.